Introduction
Correctly classifying a person for being at risk for heart failure is essential for strong preventive health care. Contributing factors are known to include coronary artery disease, hypertension, obesity, infection, certain medications, and lifestyle choices such as alcohol consumption, smoking, and drug abuse. Understanding the influences of these competing factors is made easier through unsupervised machine learning which can find patterns and insights in data without explicit human guidance or instruction. Random Forest is a promising machine learning model because it correctly classifies data from large data sets, is resistant to outliers, and is easy to use. In this report we validate the application of Random Forest machine learning to support the accurate classification of heart failure among 299 patient profiles.
Classification with Decision Trees
At the heart of the Random Forest model is the decision tree. A decision tree “is a unidirectional, compact, tree-like data structure where each branch point is an attribute-data test that partitions the data rows into two or more branches.” (Sheppard 2017). An optimized tree is shallow and has few branches. A decision tree may be used to perform classification or regression analysis to predict a future value given particular features of a subject.
Yuan et. al. state the most influential decision tree algorithm is Iterative Dichotomiser 3 (ID3) (Yuan et al. 2015). ID3 is presented by Quinlan in his 1986 seminal paper Induction of Decision Trees (Quinlan 1986). This paper lays the foundation for using machine learning to address the bottleneck caused by expert systems. Quinlan demonstrates it is possible to use decision trees to generate knowledge and solve difficult problems through automation without much computation (Quinlan 1986). Low-levels of noise do not cause the tree building to “fall over a cliff” and fail (Quinlan 1986). Furthermore, training the decision tree with noise has been discovered to yield optimal results. In addition, unbalanced trees tend to perform statistically better than those balanced through stratification (Mascaro et al. 2014).
A decision tree does not have domain knowledge (Sheppard 2017). It begins as a data structure with a single node that divides the dataset into subsets according to a partitioning algorithm that learns from exposure to training data. Training data is a smaller representation of the main data set, and it contains examples of each class of datapoint found in the main data set. As the decision tree processes the training data, it grows branches and additional nodes to support the recursive partitioning of subsets into smaller subsets until this process terminates with each datapoint being assigned to a category found at each leaf node. This automated machine learning algorithm is challenged by poor-partitioning, noise, and long compute-time (Sheppard 2017).
Poor-partitioning results in elongated decision trees with impure subsets. An impure subset occurs when a branch point does not perform a pure classification. Dirty subsets lead to incorrect classifications, and ultimately to poor heart failure classification in a preventive healthcare workflow.
Noise is another challenge facing machine learning with decision trees. Noise is any impurity found within the dataset. It is random, unpredictable, and unexpected data that disrupts the ability to correctly partition data into categories. It is what some call entropy (Sheppard 2017). The more entropy there is, the more challenging it is to correctly classify each datapoint.
Long compute-time is influenced by both size and complexity of the classification dataset. Complex datasets with more categories require a decision tree with more nodes and more branches. Taller and fatter decision trees require more compute-time for processing. Pushing more data through a larger and more complex decision tree has a multiplying effect on the compute-time.
Classification with Random Forest
Leo Breiman addresses poor-partitioning, noise, and long compute time with his proposal of the Random Forest model (Breiman 2001). Random Forest is a group of decision trees (a forest) that are created from identically distributed, independent random samples of data drawn with replacement from the original dataset (Breiman 2001). Breiman calls this random sample process a bootstrap, which others have re-worded as bootstrapping. At each node, a random subset of variables is taken from the array of features and the one that provides the best split is selected for that point. This randomness provides robustness in the presence of misclassifications and noise (Breiman 2001), (Moradi et al. 2024), (Casanova et al. 2014). This classification by aggregation is known as ensemble learning and is a proven approach for dealing with bad data (Sheppard 2017). Bad data, however, cannot include missing values for response or predictor variables (Brieuc et al. 2018).
Because each decision tree is processed with a different random set of data, each grows into a unique structure with a unique classification result. The final classification from Random Forest is a result of aggregating the results from each decision tree; a process Breiman calls bagging (Breiman 2001). This bagging technique addresses both the noise and poor-partitioning of traditional decision trees that results in inaccurate classification. By using multiple trees and aggregating the results, the effects of noise and poor-partitioning are minimized, and accuracy is increased (Breiman 2001). Furthermore, bagging can be used to calculate an “ongoing estimate of the generalization error” which is the error rate of the training set (Breiman 2001). This can be used to predict how well the model is performing during classification.
Accuracy is optimized through a process called boosting. Boosting is a technique by which more weight is given to the votes that come from decision trees that predict correctly versus those that do not predict correctly. Sheppard states “boosting earned a 1-2 percent overall improvement in the ability to predict the correct outcome while also achieving the percentage more consistently” (Sheppard 2017). Accuracy is also improved through a heuristic known as Ant Colony Optimization (ACO) (Boryczka and Kozak 2011). This approach is inspired by ant colony behavior in which individual ants leave traces of pheromone to communicate to other ants regarding the best path to take. Similarly, the ACO algorithm improves dataset splitting by providing feedback to subsequent splitting choices. By improving the choice of criteria to split on, one achieves “maximum homogeneity in the decision classes” which improves overall decision tree accuracy (Boryczka and Kozak 2011).
An important feature of random Forest is its ability to provide measures of how strong a given variable is associated with a classification result. These measures include raw importance score for class 0, raw importance score for class 1, decrease in accuracy, and the Gini Index (Esmaily et al. 2018). Gini Index is explained later. These scores help users fine-tune the Random Forest method so optimal classification results are achieved.
There are many examples of how Random Forest is used as a successful and scaleable classifier. In one study, researchers created a Random Forest model to predict in-hospital mortality for acute kidney injury patients in the intensive care unit (Lin, Hu, and Kong 2019). The Random Forest model displays the lowest Brier score (associated with accuracy) and the largest AUROC (associated with discrimination), meaning Random Forest produces the best values for these assessments of the compaired models, with the second best F1 score and accuracy.
In another experiment, researchers developed a random forest regression model capable of accurately predicting future estimated glomerular filtration rates using electronic medical record data (Zhao, Gu, and McDermaid 2019). This tool allows identification of chronic kidney disease in its early stages to more likely prevent progression into end-stage renal disease. The random forest models that are created are found to have accuracies of about 90% for stages 2 and 3, and about 80% for stages 4 and 5 of chronic kidney disease.
Scientists also investigate the accuracy of the Random Forest algorithm for the classification of gait as displaying or lacking the characteristics of hemiplegia, which is the paralysis of one side of the body (Luo et al. 2020). Researchers achieve a classification accuracy of 95.45%, which they state “is superior than all existing methods.” Because of the natural tendency for Random Forest to be resistant to over-fitting, the researchers “do not have to have careful control over the percentage of patients in the training data” (Luo et al. 2020).
When the Random Forest model is applied to the classification of persons with type 1 and type 2 diabetes, the conclusion is that “the types of attributes used have an important role in the classification process” (Rachmawanto et al. 2021). In their study, Random Forest executed with the Abel Vika dataset achieves 100% accuracy, 100% precision, and 100% recall. In contrast, when the Pima Indian Diabetes Dataset is used, the results show at most scores of 75-85% for accuracy, precision, and recall. The conclusion is that Random Forest excels when the data is noisy and unbalanced as is the case with the first dataset.
In practical applications, Random Forest’s scalability and parallelization capabilities are noteworthy, allowing it to efficiently process large datasets using parallel computing techniques like MapReduce (Yuan et al. 2015). This scalability makes random forest suitable for tasks ranging from clinical prediction in healthcare to large-scale genetic analyses, where handling big data is essential for achieving accurate and reliable results.
Random forest provides valuable insights into feature importance, aiding researchers in understanding which variables contribute most significantly to predictions. This attribute is critical in applications where identifying key factors, such as genetic markers in genome-wide association studies, environmental variables in ecological analyses, and heart failure classification is paramount. Since the Random Forest model shows robustness in the face of noisy datasets, it is an ideal algorithm for studies that involve numerous features such as those found in the biomedical sciences. Its methodology is also more easily conveyed and understood by medical professionals than many other machine learning models, allowing for greater promise in successful real-world implementation. However, in other study types Random Forest is used due to simplicity and diversity, making it one of the commonly used algorithms used for solar energy predictions (Wang et al. 2020).
In the following sections we explain the Methods pertaining to Random Forest. This includes the algorithm, measuring node purity with the Gini Index, and aggregation of classification votes from individual decision trees via a processes called bagging. The Analysis and Results section presents the classification of the Heart Failure Clinical Records dataset processed with R-code and explains how the algorithm is tuned for optimized results (“Heart Failure Clinical Records” 2020). Lastly, the Conclusion ties all of this work together and crowns Random Forest as a promising machine learning model for classifying people at risk for heart failure.
Analysis and Results
In the following section we ingest the data into R-Studio, perform classification with Random Forest, and perform analysis. Raw output is displayed for completeness. We typeset the output when appropriate and use visualization techniques to enhance understanding. We comment extensively in an effort to explain each step in the analytic process.
The Heart Failure Dataset
The Random Forest classification method is implemented in the Heart Failure Clinical Records dataset provided by the University of California Irvine Machine Learning Repository (“Heart Failure Clinical Records” 2020). The dataset comes from the Faisalabad Institute of Cardiology and the Allied Hospital in Faisalabad Pakistan and consists of the medical records of 299 patients who experienced heart failure (Chicco and Jurman 2020). There are 299 patient records with 13 features per record. We use this dataset to create and evaluate a Random Forest model used for predicting heart failure events for patients.
The features of the Heart Failure Clinical Records dataset are provided in the table below:
Feature.Name Type
1 age integer
2 anemia binary
3 creatine_phosphokinase integer
4 diabetes binary
5 ejection_fraction integer
6 high_blood_pressure binary
7 platelets continuous
8 serum_creatinine continuous
9 serum_sodium integer
10 sex binary
11 smoking binary
12 time integer
13 death_event binary
Description
1 age of patient
2 decrease of red blood cells or hemoglobin
3 level of the CPK enzyme in the blood
4 if the patient has diabetes
5 percentage of blood leaving the heart at each contraction
6 if the patient has hypertension
7 platelets in the blood
8 level of serum creatinine in the blood
9 level of serum sodium in the blood
10 woman or man
11 if the patient smokes or not
12 follow-up period in days
13 if the patient died during the follow-up period
Heart Failure Clinical Records - UCI Machine Learning Repository
|
Feature Name
|
Description
|
|
Age (integer)
|
Age of patient
|
|
Anemia (binary*)
|
Decrease of red blood cell or hemoglobin
|
|
Creatine Phosphokinase (integer)
|
Level of CPK enzyme in the blood
|
|
Diabetes (binary*)
|
If the patient has diabetes
|
|
Ejection Fraction (integer)
|
Percentage of blood leaving the heart at each contraction
|
|
High Blood Pressure (binary*)
|
If the patient has hypertension
|
|
Platelets (continuous)
|
Platelets in the blood
|
|
Serum Creatinine (continuous)
|
Level of serum creatine in the blood
|
|
Serum Sodium (integer)
|
Level of serum sodium in the blood
|
|
Sex (binary, 0 = female, 1 = male)
|
Woman or Man
|
|
Smoking (binary*)
|
If the patient smokes or not
|
|
Time (integer)
|
Follow-up period in days
|
|
Death Event (binary*)
|
If the patient died during the follow-up period
|
* Binary features; 0 = “no” and 1 = “yes”.
Source: (“Heart Failure Clinical Records” 2020)
Software and Hardware Configuration
The R programming language is used to support data processing and analysis as it is the standard for statistical programming and graphics.
RStudio Pro, 2023.12.0, Build 369.pro3 is used. Various R libraries and packages are employed including CaTools, RandomForest, Tidyverse, Readr, Dplyr, Party, Caret, Corrplot, Tidyr, Ggplot2, DT, and pRoc.
The RStudio Server is running on a RHEL9 based virtual machine within a VMware VSphere HA cluster. The VM itself has 50 vCPU’s and 196 GB of ram assigned.
The physical hardware of the cluster nodes are Dell PowerEdge R750 servers with Dual Xeon Gold 6338N (32 core) CPU’s, 512 GB of ram, and sfp28 25gbit networking for all communications (6 total sfp28 per server). Cluster storage is from a NVMe based Dell SAN.
Model Evaluation Metrics
There are a number of metrics available for assessing a classification model’s performance. The metrics used are described in the following text:
OOB error rate/accuracy: The Out-of-Bag data is the data left unused by an individual predictor (decision tree) after a bootstrap sample is taken from the original data set. The OOB data is used for internal validation– the prediction error rate is taken after running the “unseen” data through each tree and averaged to provide the overall OOB error rate. (Breiman 2001) A model’s accuracy is calculated by subtracting the error rate from 100.
A confusion matrix shows the values of True Negative, False Negative, False Positive, and True Positive predictions produced by a model. Several performance metrics for classification algorithms like the random forest method are presented by the confusionMatrix function and derived from the values provided in confusion matrices. Here we focus on precision, recall, the F1 score, and balanced accuracy.
- Precision is also known as the sensitivity and is calculated by dividing the number of true positive predictions by the sum of the true positives and false positives. This metric quantifies how many of the observations labeled as positive are actually positive (Raschka, Liu, and Mirjalili 2022).
- Recall is also known as the true positive rate — this is calculated by dividing the number of true positive predictions by the the sum of the false negatives and true positives. It quantifies how many of the positive observations have been predicted as positive (Raschka, Liu, and Mirjalili 2022).
- The F1 score is the harmonic mean of the precision and recall and so used to assess predictive performance. (Raschka, Liu, and Mirjalili 2022)
- Balanced accuracy is the average accuracy of classification for both classes. It is given by the average of the true positive rate and the true negative rate. It aids in reducing the effects of class imbalance in the data set so that accuracy of predicting the dominant class does not conceal the classification accuracy for the minority class (Brodersen et al. 2010).
AUC-ROC: Area under the receiver operating characteristic curve. This is the model’s true positive rate (on the Y-axis) is plotted against the false positive rate (on the X-axis) using varying classification thresholds (Huang and Ling 2005). An ideal ROC curve would have a high true positive rate and low false positive rate, falling into the upper left area of the graph. This equates to a value close to 1. A diagonal line on the graph represents an AUC of 0.5, equivalent to random guessing of the classifications (Raschka, Liu, and Mirjalili 2022).
Loading and Examining the Dataset
The heart failure dataset is loaded from a .csv file.
The dataset was checked for any null values, and none were found. Null values can cause errors in calculations and need to eb addressed, often by removing them or providing a value for them (a median value, a random value, or somethong else).
# A tibble: 1 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<int> <int> <int> <int> <int>
1 0 0 0 0 0
# ℹ 8 more variables: high_blood_pressure <int>, platelets <int>,
# serum_creatinine <int>, serum_sodium <int>, sex <int>, smoking <int>,
# time <int>, DEATH_EVENT <int>
The target variable (DEATH_EVENT) was then examined for the balance between the negative and positive classes.
# A tibble: 2 × 2
DEATH_EVENT n
<dbl> <int>
1 0 203
2 1 96
The DEATH_EVENT feature is converted to a factor feature since it is the binary dependent variable.
The dataset is examined for null values, and no null values are found.
The number of negative events (non-heart-failure events) are found to be about double the number of positive events (heart-failure events). This imbalance is not considered significant as the Random Forest algorithm is robust enough to tolerate this asymmetry.
Running the Random Forest Model - Default Settings
The patient health records dataset is split into a training set (80%) and a testing set (20%). An id column is created to assist in the split.
# A tibble: 6 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 63 1 122 1 60
2 50 1 168 0 38
3 45 0 582 0 20
4 85 1 102 0 60
5 65 0 56 0 25
6 73 1 1185 0 40
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <fct>
# A tibble: 6 × 13
age anaemia creatinine_phosphokinase diabetes ejection_fraction
<dbl> <dbl> <dbl> <dbl> <dbl>
1 55 0 7861 0 38
2 65 0 146 0 20
3 62 0 231 0 25
4 49 1 80 0 30
5 70 1 125 0 25
6 82 1 855 1 50
# ℹ 8 more variables: high_blood_pressure <dbl>, platelets <dbl>,
# serum_creatinine <dbl>, serum_sodium <dbl>, sex <dbl>, smoking <dbl>,
# time <dbl>, DEATH_EVENT <fct>
The training and testing sets are inspected.
239 subjects are randomly placed into the training set and the remaining 50 are designated for testing.
The random forest model is run on the dataset using the default settings.
Call:
randomForest(formula = DEATH_EVENT ~ ., data = heart_train, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 3
OOB estimate of error rate: 13.39%
Confusion matrix:
0 1 class.error
0 149 12 0.07453416
1 20 58 0.25641026
The default random forest model produced an Out-of-the-Bag (OOB) estimate of error rate of 13.39%. There is a higher class error rate in the “positive” (death event occurrence) class than the “negative” (no death occurrence) class, with 25.6% versus 7.45% rate, respectively.
0 1 MeanDecreaseAccuracy
age 3.88286662 6.37702875 7.2430140
anaemia -1.27989437 0.09593522 -0.8771973
creatinine_phosphokinase -2.41143616 -0.89764178 -2.3017959
diabetes 1.24509687 0.11403068 0.9129414
ejection_fraction 9.38043916 10.88174361 13.7073591
high_blood_pressure -0.02858881 -0.71068638 -0.6327201
platelets -0.72949140 2.05533386 0.8103532
serum_creatinine 13.46621216 9.76639727 16.4907961
serum_sodium 1.08707019 6.40982545 5.1245453
sex 0.76640143 -0.41681507 0.3513998
smoking 1.70113069 -2.11448236 0.1374946
time 42.54407306 38.39006972 47.7292136
MeanDecreaseGini
age 8.964587
anaemia 1.303405
creatinine_phosphokinase 7.631828
diabetes 1.224129
ejection_fraction 10.990121
high_blood_pressure 1.253907
platelets 8.443596
serum_creatinine 15.067978
serum_sodium 7.372099
sex 1.139253
smoking 1.215850
time 39.857407
The Variable Importance Plot above is a visual representation of the importance of each variable in our model. Circles are plotted on a scale at a location that is influenced by their accuracy and Gini Coefficient. The resulting value provides an indication of how useful a variable is for predicting an outcome. The higher the value, the greater the influence.
In the Mean Decrease Accuracy plot, time has the highest value of approximately 40. Because time is the number of days between patient visits to their health care provider, it seems the longer a patient waits for a follow-up visit, the more at risk they are for heart failure.
Serum_creatinine and ejection_fraction are tied for second at about 15 within the Mean Decrease Accuracy plot. Serum creatinine is a waste product of muscles and is usually filtered out of one’s blood by a healthy kidney. High serum_creatinine is often an indicator of unhealthy kidneys.
Ejection fraction is the amount of blood a heart pumps each time it beats. It is calculated by dividing the stroke volume by the end-diastolic volume, all multiplied by 100. An ejection fraction of 50-70% is considered healthy. A low ejection fraction rate is an indicator of heart failure and/or heart disease.
Although the numbers are slightly lower, time, serum_creatinine, and ejection_fraction are also the top three variables for predicting heart failure according to the Mean Decrease Gini Index plot.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 37 6
1 5 12
Accuracy : 0.8167
95% CI : (0.6956, 0.9048)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.02948
Kappa : 0.5565
Mcnemar's Test P-Value : 1.00000
Sensitivity : 0.6667
Specificity : 0.8810
Pos Pred Value : 0.7059
Neg Pred Value : 0.8605
Precision : 0.7059
Recall : 0.6667
F1 : 0.6857
Prevalence : 0.3000
Detection Rate : 0.2000
Detection Prevalence : 0.2833
Balanced Accuracy : 0.7738
'Positive' Class : 1
The visualization above is called a FourFold Plot and it is used to represent a 2 x 2 contingency table to support evaluation. In this case the contingency table is the confusion matrix resulting from our model. The top left square shows the algorithm correctly classifies ‘heart failure’ 36 times, and the bottom right shows the algorithm correctly classifies ‘not heart failure’ 17 times. The errors are shown in the remaining squares; the top right shows the algorithm incorrectly classifies ‘heart failure’ as FALSE (not heart failure) 4 times when it should be classified as TRUE, and the bottom left shows the algorithm incorrectly classifies ‘not heart failure’ as TRUE (heart failure) 3 times when it should be classified as FALSE. The colors are used to dual-encode correct prediction (blue) and incorrect prediction (yellow). The area of each quarter-circle is determined by a radius of \(\sqrt{f_{ij}}\) where \({f_{ij}}\) is the cell frequency (Maechler 2024). The message being portrayed is that 53 classifications are correct, and only 7 are incorrect. This yields and accuracy of 88.33%, as mentioned previously.
The Confusion Matrix Heatmap above shows an alternate way to visualize the performance of the algorithm regarding its ability to correctly predict an outcome. Just like with the FourFold Plot, the top left square shows the algorithm correctly classified ‘heart failure’ 36 times, and the bottom right shows the algorithm correctly classified ‘not heart failure’ 17 times. The errors are shown in the remaining squares; the top right shows the algorithm incorrectly classified ‘heart failure’ as FALSE (not heart failure) when it should have been classified as TRUE, and the bottom left shows the algorithm incorrectly classified ‘not heart failure’ as TRUE (heart failure) when it should have been classified as FALSE. The darker color gradient helps one perceive the relative amount recorded in each of the four areas.
The Variable Correlation Heatmap above shows the correlation between multiple model variables as a color-coded matrix. Model variables that are positively correlated have a blue square at their intersection and negatively correlated model variables have a red square at their intersection. The color saturation represents the intensity of the correlation (blue) or lack of correlation (red). For example, the intersection of serum_creatinine along the left side with age along the top results in a light blue square representing a slightly positive correlation between these two model variables. On the other hand, serum_creatinine along the left side intersects with serum_sodium along the top resulting in a slightly red square representing a slightly negative correlation between these two model variables. Other model variables like diabetes and creatinine_phosphokinase intersect with no color, which means there is no positive or negative correlation between these two model variables.
Variable correlation is used to explore the data to understand relationships that might otherwise go unnoticed. In some cases, understanding correlation can lead to the discovery of one variable causing another variable. A review of the heatmap suggests serum_creatinine might be a cause of a death_event, whereas ejection_fraction shows a negative causal relationship. We are making neither claim in our analysis and only offer these as examples for one way to interpret the results of a correlation heatmap. Furthermore, we also recognize that correlation does not mean causation, so we are careful to make such a claim. We recognize that further analysis and subject matter expertise is required to arrive at such a conclusion.
0 1
1 0.250 0.750
2 0.096 0.904
3 0.260 0.740
4 0.326 0.674
5 0.182 0.818
6 0.158 0.842
7 0.098 0.902
8 0.266 0.734
9 0.058 0.942
10 0.128 0.872
11 0.186 0.814
12 0.390 0.610
13 0.336 0.664
14 0.888 0.112
15 0.908 0.092
16 0.866 0.134
17 0.948 0.052
18 0.596 0.404
19 0.548 0.452
20 0.966 0.034
21 0.620 0.380
22 0.970 0.030
23 0.964 0.036
24 0.324 0.676
25 0.760 0.240
26 0.436 0.564
27 0.282 0.718
28 0.978 0.022
29 0.980 0.020
30 0.950 0.050
31 0.952 0.048
32 0.912 0.088
33 0.994 0.006
34 0.940 0.060
35 0.818 0.182
36 0.690 0.310
37 0.754 0.246
38 0.902 0.098
39 0.784 0.216
40 0.728 0.272
41 0.926 0.074
42 0.752 0.248
43 0.932 0.068
44 0.694 0.306
45 0.674 0.326
46 0.810 0.190
47 0.982 0.018
48 0.832 0.168
49 0.812 0.188
50 0.520 0.480
51 0.920 0.080
52 0.972 0.028
53 0.450 0.550
54 0.992 0.008
55 0.950 0.050
56 0.736 0.264
57 0.950 0.050
58 0.540 0.460
59 0.964 0.036
60 0.908 0.092
attr(,"class")
[1] "matrix" "array" "votes"
The table above shows the classification prediction probabilities for each observation in the test dataset. Rather than just stating how each observation is classified, it is useful to state the probability of that classification as well. In this case, “0” means a classification of no heart failure and “1” means a classification of heart failure. Understanding classification probabilities provides insight into how statistically strong a prediction is. For example, record 7 above shows a probability of 0.098 for no heart failure and 0.902 for heart failure. This can be interpreted as strong confidence in this particular classification instance. On the other hand, record 58 shows probabilities of 0.540 and 0.460, which means the confidence is almost split evenly between the two possibilities.
[1] "Accuracy % of default random forest: 0.816666666666667"
[1] "Area under curve of default random forest: 0.858465608465608"
The ROC (receiver operating characteristic) curve above provides a graphical representation of the true positive rate along the Y-axis vs. the false positive rate along the X-axis. This is slightly different than accuracy (shown above the ROC curve), which is just a percentage of correct predictions. The diagonal line of the ROC curve represents what the accuracy is expected to be if the data points are classified at random. The closer a curve is to the top left corner, the more accurate the model is. Perfect classification is 100%. The output above shows the classification accuracy with the default hyperparameters for the randomForest package comes to 81.67% and the area under the ROC curve is 85.85%. In this case, these are considered indications of strong accuracy.
Tuning the Random Forest Model
A random search algorithm is used to select an optimum (tuned) mtry value. The mtry value is the number of randomly drawn candidate variables. We expect “mtry” to be an acronym but we do not find any confirmation in the literature.
The output above confirms we have 13 variables (columns) in our dataset.
Random Forest
239 samples
12 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 215, 215, 216, 214, 215, 215, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
2 0.8675942 0.6884010
3 0.8645797 0.6841239
4 0.8644034 0.6858183
5 0.8630145 0.6841326
7 0.8435000 0.6431955
9 0.8421667 0.6399868
10 0.8350459 0.6266929
11 0.8393382 0.6383974
12 0.8295507 0.6159840
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
The visualization above is the result of running the trainControl() function with random selection. This function is used to specify parameters used for training. The visualization above shows the results of running the model with 2, 3, 4, 5, 7, 9, 10, 11, and 12 variables to simulate random selection. The general downward slope of the graph shows performance decreases as the number of parameters increases from 2 to 12. The conclusion is that for a random search the model’s accuracy is the highest when the number of candidate variables (mtry) is 2.
Random Forest
239 samples
12 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 215, 215, 215, 215, 216, 214, ...
Resampling results across tuning parameters:
mtry Accuracy Kappa
1 0.7759589 0.3881508
2 0.8638961 0.6790539
3 0.8596184 0.6759849
4 0.8597343 0.6754085
5 0.8554469 0.6695923
6 0.8568961 0.6719707
7 0.8457295 0.6483644
8 0.8414976 0.6368253
9 0.8401135 0.6331559
10 0.8386087 0.6300973
11 0.8345024 0.6212769
12 0.8274976 0.6082830
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
The visualization above is the result of running the trainControl() function and using grid search rather than random selection to identify the optimum number of parameters (mtry) to use in the model. The results confirm a peak accuracy when the number of candidtae variables (mtry) is 2. Thus, randomForest model is tested with this optimized mtry value below.
Call:
randomForest(formula = DEATH_EVENT ~ ., data = heart_train, mtry = 2, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 12.97%
Confusion matrix:
0 1 class.error
0 150 11 0.06832298
1 20 58 0.25641026
The above output is the result of running the Random Forest algorithm with an mtry value of 2 rather than the default 3. Notice the confusion matrix shows a negative classification rate of 6.83% and a positive classification rate of 25.64% (vs. 7.45% and 25.64% previously). The OOB estimate of error rate decreased significantly to 12.97% from 13.39% previously. These results confirm that slight improvements can be achieved if the model is tuned.
0 1 MeanDecreaseAccuracy
age 3.2540020 7.0557417 7.0824979
anaemia -2.9979407 -0.5694724 -2.7130595
creatinine_phosphokinase 1.2985399 -1.6000825 0.1451115
diabetes -0.3129181 -1.7068997 -1.4458052
ejection_fraction 9.9638983 10.5525370 13.6350220
high_blood_pressure -1.4000197 1.3288263 -0.1072657
platelets 0.7175006 0.6922643 0.9023172
serum_creatinine 13.8178045 11.9915047 17.1191782
serum_sodium 2.7946286 4.5633542 4.7913111
sex 1.1950761 -1.4033971 -0.2195930
smoking 1.6537970 -1.2290673 0.5459152
time 32.8832531 32.3490238 37.2963638
MeanDecreaseGini
age 10.045945
anaemia 1.543104
creatinine_phosphokinase 8.356436
diabetes 1.524374
ejection_fraction 11.511277
high_blood_pressure 1.579383
platelets 8.966627
serum_creatinine 15.153737
serum_sodium 8.104233
sex 1.356842
smoking 1.334717
time 32.853615
The Variable Importance plot above is run with the data from the optimized model. The results appear to be identical to the previous Variable Importance Plot, and there is no noticeable change in the most important variables. The top three most important variables for classification are still time, serum_creatinine, and ejection_fraction.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 38 6
1 4 12
Accuracy : 0.8333
95% CI : (0.7148, 0.9171)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.01388
Kappa : 0.5902
Mcnemar's Test P-Value : 0.75183
Sensitivity : 0.6667
Specificity : 0.9048
Pos Pred Value : 0.7500
Neg Pred Value : 0.8636
Precision : 0.7500
Recall : 0.6667
F1 : 0.7059
Prevalence : 0.3000
Detection Rate : 0.2000
Detection Prevalence : 0.2667
Balanced Accuracy : 0.7857
'Positive' Class : 1
The Confusion Matrix and Statistics above shows a slight improvement from previous results. The correct positive predictions increase from 37 previously to 38. The incorrect negative predictions decrease from 5 previously to 4. This results in an accuracy improvement from 81.67% to 83.33%. Thus, tuning the model results in an accuracy improvement of about 1.66%.
After changing the mtry hyperparameter to 2, the OOB estimate of error rate decreased significantly to 12.97%. The “positive” classification error stayed the same at 25.64%. The “negative” classification error decreased slightly to 6.83% (down from 7.45%). Thus, tuning the model resulted in a small but noticable improvement.
Call:
summary.resamples(object = results)
Models: 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500
Number of resamples: 30
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
100 0.7083333 0.7916667 0.8965217 0.8598357 0.9166667 0.9583333 0
200 0.7083333 0.7916667 0.8750000 0.8584469 0.9166667 0.9583333 0
300 0.7083333 0.7916667 0.8775000 0.8640072 0.9166667 0.9583333 0
400 0.7083333 0.7916667 0.8940217 0.8653406 0.9166667 0.9583333 0
500 0.7083333 0.7916667 0.9130435 0.8681787 0.9191667 0.9583333 0
600 0.7083333 0.7916667 0.8940217 0.8639469 0.9166667 0.9583333 0
700 0.7083333 0.7916667 0.9130435 0.8626184 0.9166667 0.9583333 0
800 0.7083333 0.7916667 0.9130435 0.8640676 0.9166667 1.0000000 0
900 0.7083333 0.7916667 0.9130435 0.8626787 0.9166667 1.0000000 0
1000 0.7083333 0.7916667 0.9130435 0.8626787 0.9166667 1.0000000 0
1100 0.7083333 0.7916667 0.9130435 0.8612899 0.9166667 1.0000000 0
1200 0.7083333 0.7916667 0.9130435 0.8612899 0.9166667 1.0000000 0
1300 0.7083333 0.7916667 0.9130435 0.8612899 0.9166667 1.0000000 0
1400 0.7083333 0.7916667 0.8940217 0.8612295 0.9166667 1.0000000 0
1500 0.7083333 0.7916667 0.8940217 0.8597802 0.9166667 0.9583333 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
100 0.3225806 0.4984326 0.7457640 0.6732948 0.8125000 0.9090909 0
200 0.3225806 0.4984326 0.7210508 0.6700040 0.8093750 0.9090909 0
300 0.3225806 0.5454545 0.7122532 0.6805222 0.8125000 0.9090909 0
400 0.3225806 0.5454545 0.7431882 0.6843466 0.8125000 0.9090909 0
500 0.3225806 0.5454545 0.7946429 0.6921678 0.8216912 0.9090909 0
600 0.3225806 0.5454545 0.7431882 0.6827694 0.8152574 0.9090909 0
700 0.3225806 0.5454545 0.7856709 0.6795025 0.8125000 0.9090909 0
800 0.3225806 0.5454545 0.7856709 0.6827929 0.8125000 1.0000000 0
900 0.3225806 0.5454545 0.7856709 0.6803552 0.8125000 1.0000000 0
1000 0.3225806 0.5454545 0.7856709 0.6803552 0.8125000 1.0000000 0
1100 0.3225806 0.4984326 0.7856709 0.6767567 0.8125000 1.0000000 0
1200 0.3225806 0.4984326 0.7856709 0.6767567 0.8125000 1.0000000 0
1300 0.3225806 0.4984326 0.7856709 0.6767567 0.8125000 1.0000000 0
1400 0.3225806 0.4984326 0.7519859 0.6765828 0.8125000 1.0000000 0
1500 0.3225806 0.4984326 0.7519859 0.6732924 0.8125000 0.9090909 0
In the code above the number of trees (ntree) is evaluated over the range 100 to 1500 with increments of 100 trees per set to see if accuracy can be improved by changing this hyperparameter. Note, the default value used previously is ntree = 500.
The results are easiest to understand by reviewing the visualization above. The location of the circle in each row of the accuracy visualization on the left shows the greatest accuracy is achieved when the number of trees is 500. Furthermore, the greatest kappa value is also achieved when the number of trees is 500. Cohen’s kappa is essentially a measure of “how reliably two raters measure the same thing” (DataTab 2024). Thus, the original model that uses 500 trees as a default value is also the optimum configuration for our model. Therefore, a new model run is not indicated.
0 1
1 0.380 0.620
2 0.122 0.878
3 0.326 0.674
4 0.374 0.626
5 0.248 0.752
6 0.268 0.732
7 0.206 0.794
8 0.374 0.626
9 0.086 0.914
10 0.192 0.808
11 0.272 0.728
12 0.488 0.512
13 0.428 0.572
14 0.904 0.096
15 0.928 0.072
16 0.896 0.104
17 0.908 0.092
18 0.582 0.418
19 0.524 0.476
20 0.928 0.072
21 0.608 0.392
22 0.964 0.036
23 0.892 0.108
24 0.298 0.702
25 0.700 0.300
26 0.506 0.494
27 0.266 0.734
28 0.972 0.028
29 0.954 0.046
30 0.948 0.052
31 0.954 0.046
32 0.900 0.100
33 0.984 0.016
34 0.924 0.076
35 0.794 0.206
36 0.702 0.298
37 0.778 0.222
38 0.904 0.096
39 0.780 0.220
40 0.696 0.304
41 0.932 0.068
42 0.688 0.312
43 0.876 0.124
44 0.706 0.294
45 0.578 0.422
46 0.756 0.244
47 0.984 0.016
48 0.818 0.182
49 0.838 0.162
50 0.532 0.468
51 0.844 0.156
52 0.970 0.030
53 0.434 0.566
54 0.984 0.016
55 0.930 0.070
56 0.730 0.270
57 0.946 0.054
58 0.528 0.472
59 0.934 0.066
60 0.934 0.066
attr(,"class")
[1] "matrix" "array" "votes"
The table above shows the classification prediction probabilities for each observation in the modified/tuning dataset. This output is similar to the output for the default dataset. No additional understanding is gained from this. This output is included in the this report for completeness.
[1] "Accuracy % of default random forest: 0.816666666666667"
[1] "Area under curve of default random forest: 0.858465608465608"
[1] "Accuracy % of tuned random forest: 0.833333333333333"
[1] "Area under curve of tuned random forest: 0.843915343915344"
The ROC (receiver operating characteristic) curve above is similar to the previous one except that both the default (orange) and tuned (blue) datasets are plotted together. As was stated previously, a ROC curve is a visual representation of the performance (not accuracy) of a model. Greater performance is represented as more area under the curve. In this case, we see the tuned (blue) curve and the default (orange) curve are approximately the same. Numerical analysis shows the default model has a slightly better performance at 85.85% vs. 84.39% for the tuned model for area under the curve.
Furthermore, the tuned model has a slightly better accuracy at 83.33% vs. 81.66% for the default model. The apparent conflict between performance and accuracy is resolved by understanding the area under the curve is based on actual values and the accuracy is calculated from expected values. Each is measuring something different.
Model Results
Model Results
|
Model
|
OOB Error/Training Acc
|
Testing Acc
|
Precision
|
Recall
|
F1 Score
|
Balanced Accuracy
|
AUC-ROC
|
|
Model 1 (default)
|
13.39%/86.61%
|
0.8167
|
0.7059
|
0.6667
|
0.6857
|
0.7738
|
0.8585
|
|
Model 2 (Tuned mtry)
|
12.97%/87.03%
|
0.8333
|
0.7500
|
0.6667
|
0.7059
|
0.7857
|
0.8439
|
The testing accuracies of both models are slightly lower than their respective training accuracies which indicates overfitting of the training dataset. We suspect this is caused by the training dataset being too small, or too noisy. This is a small concern and its resolution is out of scope for this report.
When looking at the default model (Model 1) we got an OOB error rate of 13.39%, which translates to a training accuracy of 86.61%. The testing accuracy was 0.8167 (81.67%). The precision value is 70.59% which means that about 71% of the observations predicted as positive are truly positive. The recall is 66.67% which suggests that this model detects positive cases from the samples of true positives in the data about 67% of the time. The F1 score is 68.57% which is the harmonic mean of the model’s precision and recall, another measure of the model’s reliability. Lastly the balanced accuracy is 73.38% which is the average of the models true positive and true negative. This value ensures both minority and majority classes are equally important in the evaluation.
The second model (Model 2) the OOB error is 12.97% which translates to a training accuracy of 87.03%. The testing accuracy is 0.8333 (83.33%). The precision value came to 0.7500 which indicates that the observations predicted as positive are truly positive 75% of the time. The recall is 66.67% which suggests that this model detects positive cases from the samples of true positives in the data about 67% of the time. The F1 score is 70.59% which is the harmonic mean of the model’s precision and recall. Lastly the balanced accuracy is 78.57% which is the average of the models true positive and true negative.
The performance of model 2 is better than model 1. This is based off of the model’s training and testing accuracies; precision; F1; and balanced accuracy as they are all higher than those of the default model. The F1 provides the overall performance of the model since it is a combination of the precision and recall percentages. The balanced accuracy is the average accuracy of classifications for both classes. The recall did not change between models. This metric is of particular importance as it is vital in cases such as heart failure prediction – a higher recall translates to a decrease in false negatives at the expense of a decrease in precision (which equates to an increase in false positives). It is typically more important to not fail to detect positive cases than to have a positive diagnosis for risk of heart failure that is a ‘false alarm.’ The ROC-AUC for the tuned model was found to be slightly lower than the default model’s (0.8439 vs 0.8585). This suggests that although the tuned model is more accurate, it did not perform as well as the default model.
Although we know which patients Random Forest classifies for heart failure, we do not know how Random Forest comes to these decisions. This is consistent with black-box machine learning. The variable importance plot provides some indication of which variables influence the classification process, but we can never know for certain. The variable importance plots for each model reveal that the top three features with the greatest influence on heart failure prediction are time before follow-up visit, serum creatinine levels, and ejection fraction amount. Age of the patient and serum sodium amount also have an influence over classification, but to a lesser degree.